Fast Search Methods for Biological Sequence Databases

نویسندگان

Sumit Ganguly

Jerry Leichter

Michiel Noordewier

چکیده

Biology researchers have a pressing need for data management technologies which will make the storage and retrieval of DNA and protein sequence data accurate and e cient. The volume of data generated by DNA sequencing is already massive and will continue to grow rapidly. Even if the current sequence databases are adequate today, they most assuredly will become inadequate in the future when far more sequence data has been determined. The direction of future research in sequence databases needs to be in the organization of information. This is so that the volume of data needing to be searched does not grow linearly with the volume of sequence data being discovered. We propose to develop an index structure and retrieval system called PROXIMAL for biological sequence databases which promises to be e cient and general. This organization of the databases will complement other current e orts at sequence comparison and analysis, by providing an infrastructure in which other methods can be used to e ciently locate desired sequences. Our method relies on the use of reference strings to partition the database of sequences. It is e cient since the use of multiple reference strings for any given distance measure greatly reduces the number of sequences that must be examined, allowing us to quickly locate sequences based on a precomputed metric. It is general since multiple distance measures can be used. These include at least di ering gap and mismatch weights for the basic edit distance calculation, or entirely di erent models of mutation. The only requirement is that there is a metric structure | mainly, that the calculations satisfy the triangle inequality. This is a weak requirement that is satis ed by many interesting measures, including those currently in wide use for sequence comparison. Sequence Databases in Molecular Biology

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Protein Databases

Proteins are sources of many peptides with diverse biological activity. Some of them are considered as valuable components of foods and drug targets with desired and designed biological activity. We are now entering an era rich in biological data in which the field of bioinformatics is poised to exploit this information in increasingly powerful ways. There are currently many databases all over ...

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

Optimal Protein Encoding

A common task in Bioinformatics is the search of sequence databases for matching sequences. In protein sequence databases, searching is hindered by both the increased amount of data and the complexity of sequence similarity metrics. Protein similarity is not simply a matter of character matching, but rather is determined by a matrix of scores assigned to every match and mismatch [5]. One strate...

متن کامل

ODM BLAST: Sequence Homology Search in the RDBMS

Performing sequence homology searches against DNA or protein sequence databases is an essential bioinformatics task. Past research efforts have been primarily concerned with the development of sensitive and fast sequence homology search algorithms outside of the relational database management system (RDBMS). Oracle Data Mining (ODM) BLAST enables BLAST to be performed in a RDBMS. ODM BLAST reli...

متن کامل

Cluster - preserving embedding of proteins by

Similarity searching in protein sequence databases is a standard technique for biologists dealing with a newly sequenced protein. Exhaustive search in such databases is prohibitive because of the large sizes of these database and because pairwise comparisons are slow. Heuristic techniques, such as FASTA and BLAST, are useful because they are fast and accurate, though it has been shown that exha...

متن کامل